了解人类情绪是智能机器人提供更好的人类机器人相互作用的关键能力。现有作品仅限于修剪视频级别的情感分类,无法找到与情感相对应的时间窗口。在本文中,我们介绍了一项新任务,称为视频中的时间情感本地化(TEL),该任务旨在检测人类的情感并将其相应的时间边界定位在带有校准字幕的未修剪视频中。与时间动作本地化相比,TEL提出了三个独特的挑战:1)情绪的时间动态极为多样; 2)情绪提示都嵌入了外观和复杂的情节中; 3)细粒度的时间注释是复杂且劳动密集型的。为了应对前两个挑战,我们提出了一个新颖的扩张上下文集成网络,该网络与粗细的两流体系结构。粗流通过建模多粒性时间上下文来捕获各种时间动力学。细流通过推理从粗流的多晶格时间上下文之间的依赖性来实现复杂的理解,并将它们自适应地集成到细粒度的视频段特征中。为了应对第三个挑战,我们引入了跨模式共识学习范式,该范式利用了对齐视频和字幕之间的固有语义共识,以实现弱监督的学习。我们为新的测试集提供了3,000个手动注释的时间边界,因此可以对TEL问题进行未来的研究进行定量评估。广泛的实验显示了我们方法对时间情绪定位的有效性。这项工作的存储库位于https://github.com/yyjmjc/temporal-emotion-localization-in-videos。
translated by 谷歌翻译
基于内容的图像检索(CIR)旨在通过同时理解示例图像和互补文本的组成来搜索目标图像,这可能会影响各种各样的现实世界应用,例如互联网搜索和时尚检索。在这种情况下,输入图像是搜索的直观上下文和背景,而相应的语言明确请求有关如何修改查询图像的特定特征以获取预期目标图像的新特征。此任务具有挑战性,因为它需要通过合并跨粒度语义更新来学习和理解复合图像文本表示。在本文中,我们通过小说\下划线{\ textbf {b}}来解决此任务\ textbf {s}} ition(\ textbf {boss})带有混合反事实训练框架,通过从两个先前被忽视的角度研究它,从而为CIR任务提供了新的启示:\ emph {隐式自下而上的自下而上的sisitiol语言表示}和sisiol语言表示}和\ emph {显式晶状体构造的明显细粒度对应}。一方面,我们利用了从底部本地特征到顶部全局语义的跨模式嵌入的隐式相互作用和组成,从而保留和转换视觉表示在多个连续步骤中以语言语义为条件的视觉表示,以进行有效的目标图像搜索。另一方面,我们设计了一种混合反事实培训策略,可以减少模型对类似查询的歧义。
translated by 谷歌翻译
基于文本的图像标题(TextCAP)需要同时对视觉内容的理解并读取图像文本以生成自然语言描述。虽然一项任务可以教导机器来了解复杂的人类环境进一步鉴于我们日常环境中的文本是全部的,但它在正常标题中提出了额外的挑战。基于文本的图像直观地包含丰富和复杂的多模式关系内容,即可以从多视图而不是单个字幕来扩散图像细节。当然,我们可以介绍额外的配对训练数据以显示图像描述的多样性,这一过程是具有额外文本的文本映射对注释的劳动密集型和耗时。基于上述洞察力,我们调查如何使用未配对的培训范例来生成专注于不同图像零件的不同标题。我们提出了多模式关系图对抗性推论(魔法)框架,用于多样化和未配对的Textcap。该框架可以自适应地构建图形之间的图像和模型复杂关系的多个多模式关系图来表示描述性分集。此外,从建模的图表中开发了一种级联的生成对抗性网络,以推断图像句子特征对齐和语言相干水平中的未配对字幕。我们验证了魔法在从图像的不同关系信息项目生成不同标题时的有效性。实验结果表明,魔法可以在不使用任何图像标题训练对的情况下产生非常有前途的结果。
translated by 谷歌翻译
当代视觉标题模型通常是幻觉的对象,其实际上并不是一种场景,因为目视错误分类或过度依赖导致视觉信息与目标词汇词之间的语义不一致。最常见的方式是鼓励标题模型将生成的对象字或短语动态链接到图像的适当区域,即接地图像标题(GIC)。然而,GIC利用辅助任务(接地对象),这些任务(接地对象)没有解决对象幻觉的关键问题,即语义不一致。在本文中,我们对上面的问题进行了一种小说 - 利用视觉和语言模式之间的语义一致性。具体而言,我们提出了与GIC的共识RRAPH表示学习框架(CGRL),其纳入接地标题管道的共识表示。通过将可视图(例如,场景图)对准到图表中的节点和边的语言图来学习共识。通过对齐的共识,标题模型可以捕获正确的语言特征和视觉相关性,然后进一步接地适当的图像区域。我们验证了我们模型的有效性,对象幻觉(-9%主席)在Flickr30k实体数据集中显着下降。此外,我们的CGR还通过多种自动度量和人体评估评估,结果表明,该方法可以同时提高图像标题(+2.9苹果酒)和接地的性能(+2.3 f1loc)。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.
translated by 谷歌翻译
As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
translated by 谷歌翻译
Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
translated by 谷歌翻译
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.
translated by 谷歌翻译